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t-h Abstract: We assess the coverage properties of confidence and credible intervals on the 

CMSSM parameter space inferred from a Bayesian posterior and the profile likelihood based 

t— I on an ATLAS sensitivity study. In order to make those calculations feasible, we introduce 

a new method based on neural networks to approximate the mapping between CMSSM 
parameters and weak-scale particle masses. Our method reduces the computational ef- 
fort needed to sample the CMSSM parameter space by a factor of ~ 10 4 with respect to 
conventional techniques. We find that both the Bayesian posterior and the profile likeli- 
hood intervals can significantly over-cover and identify the origin of this effect to physical 
boundaries in the parameter space. Finally, we point out that the effects intrinsic to the 
statistical procedure are conflated with simplifications to the likelihood functions from the 
experiments themselves. 
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1. Introduction 

Experiments at the Large Hadron Collider (LHC) will soon start testing many models 
of particle physics beyond the Standard Model (SM). Particular attention will be given 
to the Minimal Supersymmetric SM (MSSM) and other scenarios involving softly-broken 
supersymmetry (SUSY). 

We are interested in the statistical properties of the emerging methodology used to infer 
the parameters of these models. Here we consider a restricted class of SUSY models with 
certain universality assumptions regarding the SUSY breaking parameters. This scenario is 
commonly called mSUGRA or the Constrained Minimal Supersymmetric Standard Model 
(CMSSM) [1, 2]. The model is defined in terms of five free parameters: common scalar (w-o), 
gaugino (?n^ 2 ) and tri-linear (^4o) mass parameters (all specified at the GUT scale) plus the 
ratio of Higgs vacuum expectation values tan f3 and sign(/i), where fi is the Higgs/higgsino 
mass parameter whose square is computed from the conditions of radiative electroweak 
symmetry breaking (EWSB). 

A common procedure to explore the model's parameter space consisted of evaluating 
the likelihood function on a fixed grid, often encompassing only 2 or 3 dimensions at the 
time [3-8]. Of course, the number of likelihood evaluations in a grid scan scales expo- 
nentially with the dimensionality of the parameter space, making it impractical for a full 
exploration of the CMSSM parameter space. More recently new approaches based on both 
Frequentist and Bayesian statistics coupled with Markov Chain Monte Carlo (MCMC) 
methodology have been applied [9-20] . The efficiency of the MCMC techniques allow for 
a full exploration of multidimensional models. However, the likelihood function of these 
models is complex, multimodal with many narrow features, making the exploration task 
with conventional MCMC methods challenging often with low sampling efficiency. 
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More recently, a novel scanning algorithm, called MultiNest [21-23], has been proposed 
in the Bayesian context. It is based on the framework of Nested Sampling, recently invented 
by Skilling [24] . MultiNest has been developed in such a way as to be an extremely efficient 
sampler of the posterior distribution even for likelihood functions defined over a parameter 
space of large dimensionality with a very complex structure. It has been applied for the 
exploration of several models within the MSSM [23, 25-32] and its minimal extension, the 
so-called Next-to-Minimal Supersymmetric SM (NMSSM) [33]. Nested sampling has also 
been shown as a powerful technique for model reconstruction based on LHC data [34]. 

Having implemented sophisticated statistical and scanning methods, several groups 
have turned their attention to evaluating the sensitivity to the choice of priors [15, 23, 35] 
and of scanning algorithms [36]. Those analyses indicate that current constraints are not 
strong enough to dominate the posterior and that the choice of prior does influence the 
resulting inference. While confidence intervals derived from the profile likelihood or a chi- 
square have no formal dependence on a prior, there is a sampling artifact when the contours 
are extracted from samples produced from MCMC or MultiNest [23]. 

Given the sensitivity to priors and the differences between the intervals obtained from 
different methods, it is natural to seek out a quantitative assessment of their performance. 
A natural quantity for this evaluation is coverage: the probability that an interval will 
contain (cover) the true value of a parameter. The defining property of a 95% confidence 
interval is that the procedure that produced it should produce intervals that cover the 
true value 95% of the time; thus, it is reasonable to check if the procedures have the 
properties they claim. Coverage is a frequentist concept: one assumes the true parameters 
take on some specific, unknown values and considers the outcome of repeated experiments. 
Intervals based on Bayesian techniques are meant to contain a given amount of posterior 
probability for a single measurement and are referred to as credible intervals to make clear 
the distinction. While Bayesian techniques are not designed with coverage as a goal, it 
is still meaningful to investigate their coverage properties. In fact, the development of 
Bayesian methods with desirable frequentist properties is an active area of research for 
professional statisticians [37, 38]. Moreover, the frequentist intervals obtained from the 
profile likelihood or chi-square functions are based on asymptotic approximations and are 
not guaranteed to have the claimed coverage properties. 

In this work we study the coverage properties of both Bayesian and Frequentist pro- 
cedures commonly used to study the CMSSM. In particular, we consider the coverage of 
the ATLAS "SU3" CMSSM benchmark point. The task requires extensive computational 
expenditure, which would be unfeasible with standard analysis techniques. Thus in this 
paper we explore the use of a class of machine learning devices called Artificial Neural 
Networks (ANNs) to approximate the most computationally intensive sections of the anal- 
ysis pipeline. Such techniques have been successfully employed before in the cosmological 
setting to accelerate inference massively [39]. Within particle physics, ANNs have been 
used extensively for event classification, but we are not aware of their use in approximating 
the mapping between fundamental Lagrangian parameters and observable quantities such 
as a mass spectrum. 

This paper will be arranged as follows: in Section 2 we will briefly review the Bayesian 



- 2 - 



methodology and the CMSSM framework that this work operates within while in Section 3 
we will outline the use of neural networks to approximate SUSY spectrum calculators. In 
Section 3.1 we will demonstrate our method, discuss the accuracy of our network predictions 
against the dedicated spectrum calculator, and demonstrate computational performance 
improvements. In Section 4 we present the results of a coverage study employing the 
trained networks. We conclude with a discussion in Section 5. 

2. Statistical framework and super symmetry model 

The cornerstone of Bayesian inference is Bayes' Theorem, which reads 

,(e| d ) = *™ (2.i) 

The quantity p(Q\d) on the l.h.s. of eq. (2.1) is called the posterior, while on the r.h.s., 
the quantity p(d\Q) is the likelihood (when taken as a function for fixed data, d). The 
quantity p(0) is the prior which encodes our state of knowledge about the values of the 
parameters before we see the data. The state of knowledge is then updated to the 
posterior via the likelihood. Finally, the quantity in the denominator is called evidence or 
model likelihood. If one is interested in constraining the model's parameters, the evidence 
is merely a normalization constant, independent of 0, and can therefore be dropped (for 
further details, see e.g. [40]). 

We denote the parameter set of the CMSSM introduced above (mo, rni/ 2 , Aq and 
tan/3) by 9 (we fix sgn(/i) to be positive, motivated by arguments of consistency with 
the measured anomalous magnetic moment of the muon), while ip denotes the "nuisance 
parameters" whose values we are not interested in inferring, but which enter the calculation 
of the observable quantities. Here, the relevant nuisance parameters are the SM quantities 

V = {M t , m 6 K) M , a s (M z ) M , a em (M z )^} , (2.2) 

where Mt is the pole top quark mass, mf,(mf,) M is the bottom quark mass at m^, while 
a em (Mz) M and a s (Mz) M are the electromagnetic and the strong coupling constants at 
the Z pole mass Mz, the last three evaluated in the MS scheme. Note that there are 
no nuisance parameters associated with experimentally related systematics, which would 
instead be anticipated to be the case in future inference problems based on the results of 
LHC experiments. We denote the full 8-dimensional set of parameters by 

8 = (0». (2.3) 

The likelihood evaluation in a SUSY particle model analysis involves the computation 
of the mass spectrum. It implies the use of an iterative procedure in which the renormal- 
ization group equations (RGEs) are solved. The current public numerical codes reach a 
precision at the level of a few percent in computing the spectrum [41]. For it, the RGEs 
are implemented up two-loop level and the calculation of the physical masses involve the 
inclusion, at least, of the full one-loop radiative corrections. The typical running time 



- 3 - 




Figure 1: Physical (upper) and unphysical (lower) samples in too — Tn\ii space. 

for the spectrum calculator is a few seconds per model point, which can seriously hin- 
der all Monte-Carlo style explorations of the parameter space where 10 5 — 10 parameter 
points need to be examined. In addition these parameter spaces have a unphysical regions 
which emit tachyonic solutions and/or in which EWSB is not fulfilled. Figure 1 illustrates 
that these regions can be evenly spread across some projections of the 9 parameter space. 
Testing and discarding these unphysical points leads to large timing inefficiencies in the 
scan. 



3. Approximating the spectrum calculators with neural networks 

Any analysis of the CMSSM parameter space must relate the high-scale parameters 9 with 
observable quantities, such as the sparticle mass spectrum at the LHC. As described above, 
this is achieved by evolving the RGEs, which is a computationally intensive procedure. We 
use SoftSusy [42] for calculating the sparticle mass spectrum, and denote the predicted 
weak-scale sparticle masses by m. One can view the RGEs simply as a mapping from 
O — > m, and attempt to engineer a computationally efficient representation of the function. 
There is a vast literature for multivariate function approximation, which includes techniques 
such as neural networks [43], radial basis functions [44], support vector machines [45], and 
regression trees [46, 47]. 

Here we use a multilayer perceptron (MLPs), a type of feed- forward neural network. 
We restrict ourselves to three-layer MLPs, which consist of an input layer (O;), a hidden 
layer (hj), and an output layer (yj). In such a network, the value of the nodes in the hidden 
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Table 1: Parameter ranges chosen for CMSSM parameters 9 and nuisance parameters ip. We have 
adopted uniform priors on the variables and ranges shown here. 

CMSSM parameters: 6 
log(50) < logm ,logm 1//2 (GeV) < log(500) 
-4 < A (TeV) < 4 
2 < tan < 62 
SM nuisance parameters: ip 
3.92 < M t < 4.48 
163.7 < m 6 (m 6 ) M < 178.1 
0.1096 < a em {Mzf^ < 0.1256 
127.846 < a s {M z Y- MS ^ < 127.99 



and output layers are given by 

hidden layer: h 3 = g {1 \ff\ ff ] = £ u^S, + t,f\ (3.1) 

l 

outputlayer: Vi = 5 ( 2 >(/< 2) ); jf = £ w^hj + tf\ (3.2) 

j 

where the index I runs over input nodes, j runs over hidden nodes and i runs over output 
nodes. The functions and g^ are called activation functions and are chosen to be 
bounded, smooth and monotonic. The non-linear nature of the former is a key ingredient 
in constructing a viable network; we use g^ l \x) = tanhx and g^ 2 \x) = x. 

The weights w and biases £ are the quantities we wish to determine, which we denote 
collectively by a. As these parameters vary, a very wide range of non-linear mappings 
between the inputs and outputs are possible. In fact, according to a 'universal approxima- 
tion theorem' [48, 49], a standard multilayer feed- forward network with a locally bounded 
piecewise continuous activation function can approximate any continuous function to any 
degree of accuracy if (and only if) the network's activation function is not a polynomial 
and the network contains an adequate number of hidden nodes. This result applies when 
activation functions are chosen a priori and held fixed if a is allowed to vary. In practice, 
one often must empirically find a network architecture with a suitable balance of training 
requirements, generalization, and prediction accuracy. 

3.1 Network training, accuracy, and performance 

The procedure of adjusting the weights and biases of a network (and sometimes the ar- 
chitecture of the network itself) as to approximate the SoftSusy mapping is referred to 
as training, and requires the use of sophisticated algorithms. The training procedure is 
usually cast as an optimization problem with respect to some loss function C. A training 
data set V = {9^, mW} was used for this procedure, where k is an index for input- 
output pairs generated with SoftSusy. The training dataset was typically chosen to have 
a few thousand points for this application. If the training sample adequately represents the 
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function over the chosen region and the network has good generalization capabilities, then 
the trained network should approximate the function well on an independent testing sam- 
ple. Below we describe the training algorithm and assess its accuracy with an independent 
testing sample. 

Any training process requires the quantification of errors between the network outputs 
Hi and the corresponding values from the training via some loss function. In this application 
we chose a simple squared error objective function in the network parameters a: 



This is a highly non-linear, multi-modal function in hundreds or thousands of dimensions. 

The problem of network training is analogous to one of parameter estimation in a 
with Equation (3.3) playing the role of the log-likelihood function. For small networks 
this function can be adequately optimised using a form of gradient descent described as 
back propagation [50] . A complete Bayesian framework for the training of neural networks 
has been given by [51]. In this context the log of the posterior probability of a given the 
training data T> is: 



where II is a prior distribution of the network parameters a. While it is conceptually 
feasible to use conventional Monte Carlo sampling based methods such as Metropolis- 
Hastings or Nested Sampling to optimise this function and thus recover the optimal set of 
a, the computational expense of this approach would be prohibitively high. In this work 
we have used an optimising suite of software called MemSys based on the principle of 
Maximum Entropy. 

The MemSys algorithm chooses a positive-negative entropy function S as the prior 
distribution, 11(a) with a hyperparameter a which controls the relative balance between 
prior and likelihood within the posterior distribution [52] so that Equation (3.4) becomes: 



For each choice of a there is a set of network parameters d that maximises the posterior 
distribution. As a is varied one obtains a path of maximum posterior solutions d(a) called 
the maximum- entropy trajectory. The MemSys algorithm begins the exploration of the 
joint distribution of a and a by slowly introducing the likelihood term in the posterior 
with initially large values of a. For each sampled value of a the algorithm converges to the 
maximum posterior solution d via conjugate gradient descent. The parameter a is then 
slowly reduced via a rate parameter such that the new maximum entropy solution a' lies 
close to the previous solution d, thus ensuring a smooth descent to the optimal point in a 
and a. MemSys then returns an estimate of the maximum posterior network weights and 
biases. For more complete details of the MemSys algorithm we refer the reader to [53]. 

The training data was produced by uniformly sampling roughly 4000 points within the 
region of parameter space defined by Table 1. Note that the region of parameter space 
has been restricted to the vicinity of a SUSY benchmark point considered in Section 4. A 




(3.3) 



A.- 



]nP(a\V) oc -C(V\a) + In 11(a) 



(3.4) 



]nP(a\V) oc -C(V\a) + a In 5(a). 



(3.5) 
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simple pre-processing step was used to speed up the network optimization: this involves 
mapping all inputs and outputs linearly so they have zero mean and a variance of one- 
half. While MemSys provides algorithms for optimising the network architecture, for our 
function approximation use-case we found empirically that 10 hidden nodes provided a 
sufficiently accurate and computationally efficient network architecture. 

Figure 2 show correlation plots of the SoftSusy outputs with their neural network 
equivalents. Figures (a) - (1) illustrate the main mass spectrum components and all but 
m\ T (g) demonstrated excellent correlation coefficients of > 0.9999. The training problems 
for m\ T (g), where the best correlation coefficient achieved was no greater than 0.998, was 
overcome by breaking the problem up into more basic components. In this case we noted 
that m\ T and m 2 ST are eigenvalues of a (symmetric) 2x2 matrix. By asking the network 
instead to interpolate over the 3 distinct matrix elements (shown in figures (m) -(o)) and 
then perform the eigenvalue decomposition independently, it was possible to improve the 
final correlation on m\ T to 0.9999. 

The impact of replacing SoftSusy with the network is summarized in Fig. 3 and 4. 
While the approximate mapping will slightly modify the obtained posterior distributions 
(network noise), this must be compared to the intrinsic variability (sampling noise) in the 
results due to the finite sampling of techniques such as MCMC and MultiNest. Figure 3 
shows that the boundaries of univariate intervals from the two approaches tend to be 
compatible within about 2a of the sampling noise. Figure 4 shows two-dimensional scatter 
plots of these boundaries. Figure 5 compares the posterior obtained for a single run using 
the trained network with the posterior obtained with a run using SoftSusy, showing the 
excellent agreement between the two. 

The neural network has been trained to perform to high accuracy only in the limited 
parameter range given in Table 1, since this was sufficient for the specific benchmark point 
being considered here. If one wanted to consider other regions of the CMSSM parameter 
space (e.g., the focus point), one would have to extend the network training to include such 
regions. While we have not experimented with this, based on our experience we believe 
that this should be straightforward to do. 

3.2 A classification network for unphysical parameter values 

A very high proportion of points sampled from within the full parameter range are re- 
jected by SoftSusy as unphysical. Furthermore there is no trivial way to identify such 
points when sampling without resorting to computationally expensive calculations. This 
is demonstrated by the large degree of overlap seen in Figure 1 between physical (upper 
points, green) and unphysical (lower points, red) samples in the mo, plane. Such un- 
physical samples have no valid mass spectrum and so will lie outside of the trained region of 
the function approximation networks described in the previous section. Thus, we require a 
means to exclude these points while sampling. Neural networks can be trained to perform 
for classification such as this, with modification of the training objective function. In this 
section we will describe classification networks and how they have been implemented for 
this application. 
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The aim of classification is to place members of a set into subsets based on inherent 
properties or features of the individuals, given some pre-classified training data. Formally, 
classification can be summarised as finding a classifier C : — > C which maps an object 
from some (typically multi-dimensional) feature space G to its classification label C, which 
is typically taken as one of {1, -~,N} where N is the number of distinct classes. Thus the 
problem of classification is to partition feature space into regions, assigning each region a 
label corresponding to the appropriate classification. In our context, the aim is to classify 
points in the parameter space G into one of two classes, physical or unphysical (hence 
N = 2), given a set of training data T> = (6",^'), where t^ k ' is a vector of dimension N 
that encodes the class label by placing unity in the C th component and zero elsewhere. 

In building a classifier using a neural network, it is convenient to view the problem 
probabilistically. To this end we again consider a 3-layer MLP consisting of an input layer 
a hidden layer (hj), and an output layer (ui). In classification networks, however, the 
outputs are transformed according to the softmax procedure 

, _ e u - 

such that they are all non-negative and sum to unity. In this way can be interpreted 
as the probability that the input feature vector O belongs to the ith class. A suitable 
objective function for the classification problem is then 

£ ( b ) = EE^ )ln ^( 0(fc) ' b )' ( 3 - 7 ) 

k i 

where k is an index over the training dataset and i is an index over the N classes. One 
then wishes to choose network parameters b so as to maximise this objective function as 
the training progresses. Training of this network was again performed using the MemSys 
package as previously described. In our context, the network has just two (transformed) 
outputs u[ and u' 2 , which give the probabilities that the input G is physical or unphysical. 
We partition Q space, albeit with some loss of information, by labeling regions according 
to whether u\ or u' 2 exceeds some threshold value. 

The training data was taken as a set of uniform samples of roughly 30,000 prior samples 
equally divided between physical and unphysical status. Again a pre-processing step was 
carried out so that all inputs were re-mapped to have zero mean and variance of one-half to 
improve training efficiency. The number of hidden nodes was selected manually and nine 
hidden nodes were found to perform well. 

It is common in signal processing to graphically represent the efficiency of a classifier 
using a receiver operating characteristic (ROC) curve which plots the true positive rate 
(TPR) versus the false positive rate (FPR) for increments of the classifiers discrimination 
threshold. In this case the network outputs u\ can be considered probabilities and so one 
can set a probability threshold above which the classifier should record a positive detection 
of membership. Variation of this criterion allows classifiers to make a tradeoff between 
the FPR and the TPR. The left panel of figure 6 shows the ROC curve for the network 
classifier (solid line). An ideal classifier produces a step- function ROC curve with a TPR 
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Table 2: Fisher matrix for the ATLAS likelihood employed in the analysis. All entries have been 
multiplied by 10 3 and are in units of GeV~ 2 . 

of 1 for all values of the threshold whereas a random classifier (black line) always provides 
equal numbers of true and false positives. For this analysis, adequate results were obtained 
using a threshold of 0.5, which produces a TPR of 99.8% and a FPR of less than 10%. 
It is possible to obtain arbitrarily small FPR by reducing the threshold (as shown in the 
right panel of figure 6), however this does result in smaller values of the TPR. Figure 7 
demonstrates however that the precise choice of threshold has a minimal effect on the final 
posteriors obtained. 

4. Coverage of intervals at the ATLAS benchmark 

Recently the ATLAS collaboration has studied, the so-called "SU3" benchmark point, 
which is found in the bulk region, assuming an integrated luminosity of 1 fb _1 [54]. The 
analysis considered dilepton and lepton+jets final states from the decay chain qi — > x!>( — > 
l^P)q — > q and high-p-p and large missing transverse energy from the decay chain 

q~R — > X\Q- The aim was to reconstruct the CMSSM (mSUGRA in the ATLAS analysis) 
model parameters evaluating the accuracy in the reconstruction. For this they have done a 
MCMC exploration. In a recent work some of us have investigated the analogous constraints 
using the MultiNest algorithm and the profile likelihood ratio with and without constraints 
from dark matter relic abundance [34]. 

Table 2 shows the inverse of the covariance matrix (i.e., the Fisher matrix) for spar- 
ticle masses used in this study, which is based on the parabolic approximation of the 
log-likelihood function reported in Ref. [54]. This likelihood function is based on the mea- 
surement of edges and thresholds in the invariant mass distributions for various combina- 
tions of leptons and jets in final state of the selected candidate SUSY events. Note that the 
relationship between the sparticle masses and the directly observable mass edges is highly 
non-linear, so a Gaussian is likely to be a poor approximation to the actual likelihood 
function. Furthermore, these edges share several sources of systematic uncertainties, such 
as jet and lepton energy scale uncertainties, which are only approximately communicated 
in Ref. [54]. While noting that these approximations to the likelihood function may omit 
essential information in the data, we proceed to assess the coverage of the intervals in this 
idealized setting. 

In order to assess coverage, we not only need the likelihood function (P(d\Q) with d 
fixed), but we also need the ability to generate pseudo-experiments with G fixed. Here 
we introduce an additional simplification that P(d\@) is also a multivariate Gaussian with 



- 9 - 



the same covariance structure. For a multivariate Gaussian, one would expect the profile 
likelihood to have nearly perfect coverage properties, unless there is a breakdown in one of 
the regularity requirements present in Wilks's theorem [55]. Two of the requirements that 
are most likely to be violated (or lead to a slow convergence to the asymptotic properties 
assured by Wilks's theorem) are related to the presence of boundaries on the parameter 
space or parameters that have no impact on the likelihood function in certain regions of 
the parameter space. As discussed above, excluding unphysical regions from the CMSSM 
parameter space imposes complicated boundary conditions. Similarly, in some regions of 
parameter space, certain interactions can be highly suppressed (either due to quantum 
effects or phase space considerations) , in which case some parameters may have little effect 
on the likelihood function. In the case of four CMSSM parameters and four non-degenerate 
measurements, the latter is not an issue, but boundary effects may be relevant. 

In addition to the complications due to the structure of the model itself, there are 
other issues which could spoil the expected coverage properties. First, some algorithmic 
artifact or shortcoming associated with the procedure that creates the intervals. These are 
complex, high dimensional problems, so it is possible that the algorithms may not perform 
in practice as well as they should in principle. Secondly, if the likelihood function is not 
strong enough to overcome the prior and they prefer different regions of parameter space, 
one may expect poor coverage properties. 

The sensitivity to priors and comparisons with the profile likelihood ratio were consid- 
ered for the same ATLAS benchmark point in Ref. [34]; reasonable agreement between the 
different approaches was found, indicating the likelihood function is dominating the infer- 
ence. Thus, we focus our attention on effects from algorithmic artifacts and the complicated 
boundaries of the physical CMSSM. 

In order to isolate algorithmic effects, we begin by constructing intervals in the weak 
scale masses m (as opposed to the more fundamental CMSSM parameters 9). The likeli- 
hood function is invariant under these reparametrizations, and it is trivial to extend the 
domain of the weak scale masses so that boundary effects are absent. In total 10 4 pseudo- 
experiments were analyzed using MCMC using 10 6 samples each. A flat prior was used on 
the space of m. From the MCMC samples, equal-tail and shortest credible intervals were 
constructed by marginalizing the posterior distribution to each of the four parameters. Ad- 
ditionally, profile likelihood intervals were constructed by maximizing the likelihood across 
the samples. Figure 8 shows that each of the restricted 1-d intervals covers with the stated 
rate for each of the methods. Thus, we conclude that our MCMC algorithm is performing 
as expected for this likelihood function. 

To assess the coverage in the restricted space of the physically realizable CMSSM 
requires several additional ingredients 1 . First, we must introduce the mapping — > m 
and the boundaries on corresponding to physically realizable parameter points. It is not 

x As this work was being finalized, we learnt of a similar study being carried out in the context of a 
coverage study of the CMSSM from future direct detection experiments [56] . While the spirit of the present 
paper and of Ref [56] is broadly similar, our approach takes advantage of the large computational speed-up 
afforded by neural networks. Both works clearly show the timeliness and importance of evaluating the 
coverage properties of the reconstructed intervals for future data sets. 
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computationally feasible to use SoftSusy for so many repeated likelihood evaluations, 
thus we rely on the neural network approximation to the mapping and a second network 
for classification of the physical points. Additionally, we must specify a prior on the space 
of 0; for this we used log priors in mo,?rti/2 and note there was little sensitivity to this 
choice in previous studies based on the same ATLAS likelihood function [34] . We construct 
10 4 pseudo-experiments and analyze them with both MCMC (using a Metropolis-Hastings 
algorithm) and MultiNest. Each realization is run as an independent chain until we gather 
a total of 10 6 posterior samples in the MCMC chain. For MultiNest, we run until the 
tolerance criterium is satisfied (with a tolerance parameter 0.5), which results in about 50k 
posterior samples. Altogether, our neural network MCMC runs have performed a total of 
4 x 10 10 likelihood evaluations, in a total computational effort of approximately 2 x 10 4 
CPU-minutes. We estimate that the same computation using SoftSusy would have taken 
about 1100-CPU years, which is at the boundary of what is feasible today, even with a 
massive parallel computing effort. By using the neural network, we observe a speed-up 
factor of about 3 x 10 4 compared with scans using the explicit spectrum calculator. 

As above, the posterior samples were used to form equal-tailed, central, and profile- 
likelihood based intervals for each of the four fundamental CMSSM parameters. Coverage 
was then computed for each type of interval. In order to quantify the uncertainty in 
the coverage introduced by sampling noise (i.e., the finite length of our MCMC chains) 
and by the neural network approximation to the full RGE calculation, we proceeded as 
follows. First, for each lower/upper limit (at a given confidence level) we built the quantity 
ctot = { a sN + °nn) 1//2 ' wnere o'SN is the sampling noise standard deviation while ctnn is 
the standard deviation from the neural network noise for that limit, both of which have 
been estimated by the pseudo-experiments of section 3.1. We then shifted the intervals 
uniformly downwards or upwards (keeping the intervals' length constant) by an amount 
given by the appropriate <7tot- This leads to a higher/lower coverage (depending on the 
direction of the shift and the limit being considered), which we used as an estimation of 
the standard deviation of the coverage value itself. This quantity is shown as an errorbar 
on the coverage values. We note that because of the large number of pseudo-experiments 
employed, the statistical noise from the binomial process itself is negligible compared with 
the above uncertainties. For the 2-d intervals we proceeded in a similar fashion, with the 
direction of the shift (in the 2-d plane) chosen to be parallel to the line connecting the 
mean value of the 2-d upper/lower limit (i.e., this is the most conservative choice for the 
error). 

The results are shown in Fig. 9, where it can be seen that the methods have substantial 
over-coverage (are conservative). The coverage of 2-d intervals is also shown in Fig. 10. 

4.1 Discussion of coverage results 

While it is difficult to unambiguously attribute the over-coverage to a specific cause, the 
most likely cause is the effect of boundary conditions imposed by the CMSSM. An example 
of such boundaries and how they compare with the extent of the likelihood function is shown 
in Fig. 11. This figure shows regions in some of the weak-scale mass combinations entering 
the likelihood which can be realized within the CMSSM (green shades). The red ellipses are 
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la contours from the likelihood, and are centered around the benchmark value. It is clear 
that likelihood extends beyond the boundaries imposed by the CMSSM, which leads to a 
slow convergence of the profile likelihood ratio to the asymptotic chi-square distribution. 

Recall that the profile likelihood ratio is defined as X(9) = L(9, ijj)/L(0, if)), where ip is 
the conditional maximum likelihood estimate (MLE) of ip with fixed, and 6, if) are the 
unconditional MLEs. When the fit is performed in the space of the weak-scale masses, there 
are no boundary effects, and the distribution of — 21nA(m) (when m is true) is distributed 
as a chi-square with a number of degrees of freedom given by the dimensionality of m. 
Since the likelihood is invariant under reparametrizations, we expect — 21nA(#) to also be 
distributed as a chi-square. If the boundary is such that m(6, ip) ^ rh or m(9, t/j) ^ m, 
then the resulting interval will modified. More importantly, one expects the denominator 
L(0,ifj) < L(rh) since m is unconstrained, which will lead to — 21nA(#) < — 21nA(m). In 
turn, this means more parameter points being included in any given contour, which leads 
to over-coverage. Evidence for this effect can be seen in Fig. 12, which shows the value of 
—2 In A for the scans in m and 0. 

The impact of the boundary on the distribution of the profile likelihood ratio is not 
insurmountable. It is not fundamentally different than several common examples in high- 
energy physics where an unconstrained MLE would like outside of the physical parameter 
space. Examples include downward fluctuations in event-counting experiments when the 
signal rate is bounded to be non-negative. Another common example is the measurement 
of sines and cosines of mixing angles that are physically bounded between [—1, 1], though 
an unphysical MLE may lie outside this region. The size of this effect is related to the 
probability that the MLE is pushed to a physical boundary. If this probability can be 
estimated, it is possible to estimate a corrected threshold on —2 In A. For a precise threshold 
with guaranteed coverage, one must resort to a fully frequentist Neyman Construction. 

5. Summary and conclusions 

We have presented a coverage study of state of the art techniques used for supersymmetric 
parameter estimation based on realistic toy data from an ATLAS sensitivity study. The 
techniques employ both MCMC and nested sampling to explore the parameter space and 
the posterior samples can be used to provide both Bayesian credible intervals and pro- 
file likelihood-based frequentist confidence intervals. We have tested our coverage analysis 
pipeline in a simple Gaussian toy model with the same dimensionality as the supersym- 
metric parameter space under consideration, and have confirmed the reliability of both 
the Bayesian posterior and profile likelihood intervals in this setting obtained with nested 
sampling and MCMC. 

The likelihood function used in these studies is a simplified representation of the actual 
likelihood based on the information in the ATLAS publication. We note that in order to 
perform coverage studies, one not only needs the ability to evaluate the likelihood function 
based on the pseudo-experiment, but also the ability to generate the pseudo-experiment 
itself. Both the representation of the likelihood function and the ability to generate pseudo- 
experiment are now possible with the workspace technology in RooFit/RooStats [57]. We 



- 12 - 



encourage future experiments to publish their likelihoods using this technology. Further- 
more, we point out that these likelihood functions should include the nuisance parameters 
associated with experimental systematics, since they may be correlated across the different 
measurements entering a global fit. 

We observe good coverage properties for intervals based directly on the weak-scale 
sparticle masses, confirming the reliability of both the Bayesian credible intervals and 
profile likelihood intervals obtained with nested sampling and MCMC in this setting. In 
contrast, we observe significant over-coverage in the fundamental CMSSM parameters, 
which we attribute to boundaries in the parameter space imposed by requirements of 
physically realizable theories. Our findings are specific to the benchmark point considered 
in our analysis, and coverage properties of benchmark points in different regions of the 
CMSSM parameter space might differ substantially. Currently, there is no known way 
to generalize the results, thus coverage requires a case-by-case analysis. The technique 
presented in this paper is, however, readily applicable to any benchmark point, provided a 
suitable likelihood function for LHC data is available, and that the neural network training 
is extended appropriately. 

Coverage studies are computationally expensive to perform, and the most intensive 
step for an individual likelihood evaluation in this context is the evaluation of the RGEs 
in the spectrum calculators. Thus, we have introduced a method to approximate the spec- 
trum calculators with neural networks. We observe consistency in the obtained intervals 
with speed-ups of 10 4 by utilizing neural network approximations. In principle, the same 
techniques can be applied to accelerate the computation of other observables, such as the 
relic density. 

In order to assess the actual coverage properties of both Frequentist and Bayesian 
intervals obtained from future data, it will be unavoidable to perform detailed studies 
using a large number of pseudo-experiments. This paper demonstrated for the first time 
how neural network techniques can be employed to achieve a very significant speed-up with 
respect to conventional spectrum calculators. Such methods are necessary in order to cope 
with the very large number of likelihood evaluations required in coverage studies. Our 
work is but a first step towards making accelerated inference techniques more widespread 
and more easily accessible to the community. 
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Figure 2: Correlation plots comparing the mass spectrum [GeV] output of SoftSUSY with the 
neural network approximations NN. Subhgures (m)-(o) show the components of the mass matrix 
dcfming m]^ . 
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Figure 3: Evaluation of network noise. Horizontal red (green) lines denote error bands for the 
upper and lower boundaries of 68% (95%) intervals for the CMSSM parameters. The red (green) 
lines denote ±ler (±2er) from 100 runs of the SuperBayes fitting package with SoftSusy and 
quantify numerical/sampling noise. Units of the vertical axis have been rescaled appropriately 
for display purposes, so that all error bands have the same width. The blue error bars are the 
corresponding 2<r error bars based on 10 4 runs using the neural network in place of SoftSusy 
(with the corresponding rescaling). 
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Figure 4: Evaluation of network noise. In the upper (lower) row, red dots give the location of 
the upper (lower) 68% and 95% intervals for 10 4 runs using the neural network, while blue dots are 
the locations from 100 runs of the SuperBayes fitting package with SoftSusy. 
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Figure 5: Comparison of Bayesian posteriors between runs of SuperBayes with SoftSusy (black 
lines, giving 68% and 95% regions) and neural networks (blue lines and corresponding filled regions) 
obtained using MultiNest, for a typical reconstruction. The agreement between the two methods is 
excellent, within numerical noise. The red diamond gives the true value for the benchmark point 
adopted. 



DC 




FPR criterion threshold 



Figure 6: Left panel: Receiver Operating Characteristic (ROC) curve of the neural network 
classifier for physical and unphysical regions of parameter space (red curve) and a random classifier 
(black curve). Right panel: False Positive Rate (FPR) as a function of criterion threshold. 
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Figure 7: Comparison of Bayesian posteriors between runs of SuperBayes with SoftSusy (black 
contours, giving 68% and 95% regions) and neural networks for different values of the classification 
threshold adopted (coloured contours, threshold value according to the legend). The agreement 
between all posteriors (within numerical noise) demonstrates the classifier's resiliance to changes in 
the threshold value. 
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Figure 8: Test of coverage for a toy 8D MCMC model. Green (red) is for the nominal 68% 
(95%) error. The recovered coverage is compatible with the exact value within the errors due to 
numerical/sampling noise. Error bars denote the standard deviation from 10 4 reconstructions with 
MCMC. 
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Figure 9: Coverage for various types of intervals for the CMSSM parameters, from 10 4 realizations. 
The top row employs MCMC for the reconstruction (each pseudo-experiment is reconstructed with 
10 6 samples), while to bottom row uses MultiNest (with 5 x 10 4 samples per pseudo-experiment). 
Green (red) is for the 68% (95%) error. 
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Figure 10: Coverage for various types of two-dimensional intervals for the CMSSM parameters, 
from 10 4 realizations based on MCMC (with 10 6 samples per pseudo-experiment). Green (red) is 
for the 68% (95%) error. 
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Figure 11: Illustration of the boundary effects coming from the physically realizable CMSSM 
(green region) with respect to the unconstrained likelihood expressed in the weak-scale masses (red 
ellipses). The likelihood extends beyond the boundaries imposed by the CMSSM, which leads to a 
slow convergence of the profile likelihood ratio to the asymptotic chi-square distribution. 
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Figure 12: The distribution of —2 In A for the scans in m (top row) and 8 (bottom row) from 
repeated realizations of pseudo-experiments. The solid line is a \ 2 distribution with 1 dof expected 
from Wilks' theorem. The distributions from scans in the weak-scale masses (top row) follow quite 
closely the % 2 distribution, while this is not the case for the scans using the CMSSM parameters as 
fundamental quantities. 



- 24 - 



